In [64]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
In [65]:
titanic = pd.read_csv('titanic_data.csv')
titanic.describe()
Out[65]:
In [66]:
titanic.head()
Out[66]:
Using the fare column for this question is complicated, because some passengers paid for their companions (family members, maids, nurses...), and prices varied greatly even for the same class.
Thus, using the passenger class is a more straightforward way. For each passenger, we need the class and the port they embarked from. Unfortunately, we do not always know the port, so we will drop these rows:
In [67]:
class_port = titanic[['PassengerId', 'Pclass', 'Embarked']]
print class_port.isnull().any()
print
class_port = class_port.dropna()
print class_port.isnull().any()
We need the number of passengers for each port of embarkment and class:
In [68]:
passengers_by_port_class = class_port.groupby(['Embarked', 'Pclass'], as_index=False).count()
print passengers_by_port_class
and the total number of passengers for each port:
In [69]:
passengers_by_port_class['Total'] = passengers_by_port_class.groupby('Embarked')['PassengerId'].transform(sum)
print passengers_by_port_class
Now we can calculate the percentage of each class per port:
In [70]:
passengers_by_port_class['Percent'] = passengers_by_port_class.PassengerId * 100 / passengers_by_port_class.Total
print passengers_by_port_class
We can use a bar chart to visualize the results:
In [71]:
%pylab inline
c_percents = passengers_by_port_class.loc[passengers_by_port_class['Embarked'] == 'C']['Percent']
q_percents = passengers_by_port_class.loc[passengers_by_port_class['Embarked'] == 'Q']['Percent']
s_percents = passengers_by_port_class.loc[passengers_by_port_class['Embarked'] == 'S']['Percent']
class_port_percents = pd.DataFrame.from_dict({'Cherbourg':c_percents.values, 'Queenstown':q_percents.values, 'Southampton':s_percents.values}, orient='index')
class_port_percents.columns = ['1st', '2nd', '3rd']
ax = class_port_percents.plot(kind='bar', rot=0, title='Percentage of passengers in each class by port of embarkment')
ax.set_xlabel('Port')
Out[71]:
The graph shows that more than 90% of the passengers that embarked from Queenstown in Ireland where third class passengers, in contrast with the other ports, where third class made up for 40-55% of the total number of passengers.
The Maddison Project Database shows that Ireland had an estimated GDP per capita of 2.736 (in 1990 International Geary-Khamis Dollars) by 1913 (no data for 1912), while the UK had a GDP of 4.762 and France's was 3.514 by 1912.
Our passenger data coincides with the Maddison Project in the case of Ireland: it was a poor country compared to Europe's average by the year the Titanic sank. However, our data cannot explain why 1st class percentage in Cherbourg doubles Southampton's, being the UK a richer country than France during that time. Maybe the Cherbourg region was wealthier than the average of France, or Southampton was specially poor, but we do not have the data to check this.
Of course, there are other problems. For instance, we do not know if Titanic passengers make a representative sample of the demography of these regions.
In [72]:
sex_age_surv = titanic[['PassengerId', 'Sex', 'Age', 'Survived']]
sex_age_surv_count = sex_age_surv.count()
print sex_age_surv_count
We do not have the age for every passenger, so we will have to take this into account when analysing the data. The rest of the data is complete.
First, we will analyze the effect of sex alone on survivability. We can use a bar chart with the percentage of survivors for the total sample and for every sex, like we did in the last question for the class and port.
In [73]:
sex_surv = sex_age_surv[['PassengerId', 'Sex', 'Survived']]
sex_surv_count = sex_surv[['PassengerId', 'Sex']].groupby('Sex').count()
print sex_surv_count
Because the Survived column can only contain 1 for yes and 0 for no, we can count all the survivors of each sex by grouping by sex and adding the Survived column.
In [74]:
sex_surv_count['Survived'] = sex_surv[['Survived', 'Sex']].groupby('Sex').sum()
print sex_surv_count
In [75]:
totals = pd.Series(sex_surv_count.sum(), name='total')
sex_surv_count = sex_surv_count.append(totals)
sex_surv_count['SurvPerc'] = sex_surv_count.Survived * 100 / sex_surv_count.PassengerId
print sex_surv_count
In [76]:
sex_surv_count['SurvPerc'].plot(kind='bar', rot=0, title='Percentage of survivors by sex')
Out[76]:
It looks like the survival rate is very biased towards the female sex: nearly three quarters of women survived, compared to a bit less than one fifth of the men aboard the Titanic. The total survival rate is low because there were more men than women aboard.
Now we will analyze age. We choose to drop the passengers for whom we do not know the age. Because age is a continuous variable, we also have to group the passengers in age bins in order to have meaningful statistics:
In [77]:
age_surv = sex_age_surv[['PassengerId', 'Age', 'Survived']].dropna()
print age_surv.Age.max()
age_surv['bin'] = pd.cut(age_surv['Age'],np.arange(0,90,10), right=False)
print age_surv.head()
We can know calculate the number of passengers in every age bin and how many of them survived:
In [78]:
age_surv_count = age_surv[['PassengerId', 'bin']].groupby('bin').count()
age_surv_count['Survived'] = age_surv[['Survived', 'bin']].groupby('bin').sum()
age_surv_count['SurvPerc'] = age_surv_count.Survived * 100 / age_surv_count.PassengerId
print age_surv_count
In [79]:
ax = age_surv_count['SurvPerc'].plot(kind='bar', rot=0, title='Percentage of survivors by age')
ax.set_xlabel('Age')
Out[79]:
The graph shows a survival rate of 60% for children below 10 years old. For the rest of the age groups, the survival rate oscillates between 30% and 45%. The exceptions are the oldest passengers, with ages between 70 and 80, whom none of them survived. But there were only 6 of them, so it is hard to know if there was an underlying cause or just randomness.
Now we will calculate the survival rate by sex and age group, to see how this two variables work together.
In [80]:
sex_age_surv_copy = sex_age_surv.copy()
sex_age_surv_copy.dropna()
sex_age_surv_copy['bin'] = pd.cut(sex_age_surv['Age'],np.arange(0,90,10), right=False)
print sex_age_surv_copy.head()
In [81]:
sex_age_surv_count = sex_age_surv_copy[['PassengerId', 'Sex', 'bin']].groupby(['Sex', 'bin']).count()
sex_age_surv_count['Survived'] = sex_age_surv_copy[['Survived', 'Sex', 'bin']].groupby(['Sex', 'bin']).sum()
sex_age_surv_count['SurvPerc'] = sex_age_surv_count.Survived * 100 / sex_age_surv_count.PassengerId
print sex_age_surv_count
No females in the age group 70-80. Maybe that is one of the causes nobody survived in this bin. Now we plot female and male survival rates by age group:
In [82]:
ax = sex_age_surv_count.loc['female']['SurvPerc'].plot()
ax = sex_age_surv_count.loc['male']['SurvPerc'].plot(ax=ax, title='Percentage of female and male survivors by age')
ax.legend(['Females', 'Males'], loc='best')
ax.set_xlabel('Age')
Out[82]:
The graph shows that for children below 10 both female and male survival rates are very similar. However, for the rest of age groups, the female survival rates are high, with the minimum around 70%, and male survival rates are low, with the maximum around 20%. The data suggests that the "women and children first" code of conduct was indeed followed.
We first add a Has_Cabin boolean column to our dataset, in order to ease following data wrangling operations.
In [83]:
titanic['Has_Cabin'] = titanic.Cabin.notnull()
cabin_surv = titanic[['PassengerId', 'Has_Cabin', 'Survived']].copy()
And now we continue:
In [84]:
cabin_surv_count = cabin_surv[['PassengerId', 'Has_Cabin']].groupby('Has_Cabin').count()
print cabin_surv_count
In [85]:
cabin_surv_count['Survived'] = cabin_surv[['Survived', 'Has_Cabin']].groupby('Has_Cabin').sum()
print cabin_surv_count
At first sight, it looks like having paid for a cabin affected the probability of survival. The problem is most cabins where occupied by fist class passengers:
In [86]:
print titanic[['PassengerId', 'Has_Cabin', 'Pclass']].groupby(['Has_Cabin', 'Pclass']).count()
and it looks like the passenger class also influenced survivability:
In [87]:
class_surv = titanic[['PassengerId', 'Pclass']].copy()
class_surv_count = class_surv.groupby('Pclass').count()
class_surv_count['Survived'] = titanic[['Survived', 'Pclass']].groupby('Pclass').sum()
print class_surv_count
So, a fairer comparison could be between the 40 first class passengers that did not pay for a cabin and the rest of the same class:
In [88]:
cabin_surv = titanic[['PassengerId', 'Has_Cabin', 'Survived']][titanic['Pclass'] == 1].copy()
cabin_surv_count = cabin_surv[['PassengerId', 'Has_Cabin']].groupby('Has_Cabin').count()
cabin_surv_count['Survived'] = cabin_surv[['Survived', 'Has_Cabin']].groupby('Has_Cabin').sum()
cabin_surv_count['SurvPerc'] = cabin_surv_count['Survived'] * 100 / cabin_surv_count['PassengerId']
print cabin_surv_count
In [89]:
ax = cabin_surv_count.SurvPerc.plot(kind='bar', rot=0, title='Percentage of first class survivors by cabin rental')
ax.set_xlabel('Had a cabin')
Out[89]:
The data shows that the survival rate of first class passengers with a cabin is about 20 percentage points higher than those of the same class that did not have a cabin.